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ABSTRACT 

Segregating an audio mixture containing multiple simultane- 
ous bird sounds is a challenging task. However, birdsong of- 
ten contains rapid pitch modulations, and these modulations 
carry information which may be of use in automatic recog- 
nition. In this paper we demonstrate that an improved spec- 
trogram representation, based on the distribution derivative 
method, leads to improved performance of a segregation al- 
gorithm which uses a Markov renewal process model to track 
vocalisation patterns consisting of singing and silences. 

Index Terms — birdsong, Markov renewal process, multi- 
ple tracking, distribution derivative method, reassignment 

1. INTRODUCTION 

Machine recognition of animal sounds is of growing impor- 
tance in bioacoustics and ecology, as a tool that can facili- 
tate unattended monitoring, citizen science, and other appli- 
cations with large volumes of audio data [lj|21. For bird- 
song, tasks which have been studied include recognition of 
species [3| and individuals 10115]. However, much research 
considers only the monophonic case, using recordings of sin- 
gle birds, either isolated or with low background interference. 
It is important to develop techniques applicable to mixtures of 
singing birds: because singing often occurs within flocks or 
dawn choruses, but also because there is research interest in 
analysing ensemble singing |6| and in non-invasively charac- 
terising a population |7|. The automatic recognition literature 
has only just begun to approach such polyphonic tasks |8|. 

In the present work we focus on the task of analysing a 
recording containing multiple birds of the same species (e.g. 
a recording of a flock), and identifying the streams of sylla- 
bles that correspond to a single bird. From the perspective of 
computational auditory scene analysis this task of clustering 
sounds is analogous to the well-known "cocktail party prob- 
lem" in perception [(9]. We consider the task recently stud- 
ied by 1 10 1, which develops a probabilistic model that can 
segregate such sequences of sound events modelled as point 
processes. In that work, it was observed that the quality of 
the initial detection stage (used to locate individual syllables) 
when applied to audio mixtures can be a strong limiting factor 



on the quality of the tracking. In this paper we work within 
the same paradigm and demonstrate that improvements to the 
underlying representation yield improved quality of tracking. 

In ifTTI it was observed that birdsong contains very rapid 
modulations, and that using a chirplet representation in- 
stead of standard spectral magnitudes could lead to improved 
recognition performance by making use of low -level modula- 
tion information. The technique described in that paper used 
a simple dictionary of chirplets to analyse a signal. However, 
powerful parametric techniques exist to estimate the charac- 
teristics of non-stationary signals and may be well-suited to 
this task. The generalised reassignment method (GRM) |12| 
has be shown to work well for this even when dealing with 
extreme frequency and amplitude modulations |13|. However 
difficulties arise as the linear system of equations for a third 
degree GRM becomes ill-conditioned. A related method, the 
distribution derivative method (DDM) 1 14] circumvents this. 
In addition a frequency range, rather than just a single fre- 
quency can be examined when a highly modulated sinusoid 
is assumed to occupy a significant portion of spectrum, rather 
than being concentrated around the peak frequency. 

Such techniques have not yet been widely tested in prac- 
tical applications. In the present work we demonstrate that 
the refined representation derived from the DDM leads to im- 
proved tracking of multiple singing birds. In the remainder 
of this paper, we will give an overview of the DDM, and the 
particular variant of the technique developed for the present 
study. We will then describe the multiple tracking technique 
used to infer the sequence structure contained within a record- 
ing of multiple birds. We will apply this tracking procedure 
to a dataset of birdsong recordings, analysed via either a stan- 
dard spectrogram or the DDM, showing that the improved 
spectral representation is of benefit to downstream analysis. 

2. DISTRIBUTION DERIVATIVE METHOD 

The essence of the DDM lies in a simple but powerful concept 
of the distribution derivative rule. Considering an arbitrary 
distribution x and a test function a straightforward conse- 
quence using integration-per-partes on inner product follows: 

< x', * >= - < x, > . (1) 



Treating the signal under study as a distribution, the following 
equality can be obtained using the above: 

< s', we^" >=- <s, w'eJ'^ > +jLu < s, we''^ >, (2) 

where < . , . > denotes the inner product, w the window func- 
tion of finite time support and s the signal. In such a setting 
the Fourier Transform (FT) at frequency lo can be written as: 



spectrogram 



If the signal is modelled as a generalised sinusoid: 
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the following equality: 
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=- < s, w'e-'" > +joj < s, we-'" >, 
can be compacted (using ([3]l) into: 
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where < st^,'we^'^ >=< s, {t^w)e^'^ >= Sfk^. The above 
holds for any uj and can thus be used to define a linear system 
of equations with respect to rk,k > 0; however rg cannot 
be estimated this way as it was factored out during derivation. 
The frequencies used to construct the linear system can gener- 
ally be arbitrary, though one should choose the ones that bear 
most of the energy of the sinusoids under study to avoid nu- 
merical instabilities. The set of frequencies should also cover 
a big part of the bandwidth occupied by the sinusoid: failure 
to do so would exclude important frequency domain content 
of the sinusoid, leading to inaccurate estimation. 

The complex stationary parameter tq can be estimated after 
the non-stationary parameters , fc > have been estimated. 
Substituting < s, s >=< 



yields: 



< s, e 
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< s, s > 

The parameters , fc > in ( [TO] l are substituted with esti- 
mates fk ,k > to get the estimate for . For most appli- 
cations the model with FM polynomial of degree 2 (i.e. fre- 
quency change is linear during the observation frame) is suf- 
ficient. In such case only values S^, S^i' , Stw 3l different fre- 
quencies form the linear system.The widest mainlobe width 
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Fig. 1. Top: spectrogram. Bottom: DDM spectrogram with 
linear freq polynomial of magnitude peak superimposed. 



is that of w{t)t, 5 bins in total. In order to select an optimal 
bandwidth for the DDM, a typical bird frequency change must 
be estimated from a real recording. For the Chiffchaff sounds 
considered in this paper, a typical chirp exhibits a maximum 
of 100 kHz/s frequency change, thus for an observation frame 
of 1024 samples and a sampling frequency of 44100 such 
chirp would cover roughly about 2300Hz bandwidth. It will 
be shown that covering a region of about lOOOHz is sufficient 
to estimate the linear FM accurately enough for the purpose. 

To use DDM efficiently we use the bins of the FFT: com- 
puting DFT at an arbitrary frequency has little advantage and 
increases computational load significantly. For this paper a 
bandwidth of 16 bins was considered. The cumulative effec- 
tive range of all mainlobes positioned at 16 consecutive bins 
therefore totals to 21 bins (since the widest mainlobe width is 
5 bins), almost lOOOHz in the current setting. 

The estimates depend on the frequency range examined, so 
we designate them as fk{uJL,u!H)- The frequency and ampli- 
tude estimates 3(fi(ci;L,a;^f)),3?(fo(a;L,aj/f)) are of partic- 
ular interest as a time-frequency representation (TFR) based 
on reassignment [ 15] can be constructed. The frequency esti- 
mate is generally not an exact bin value: quantising and sum- 
ming the corresponding amplitude estimates results in a TFR 
very similar to the reassigned spectrogram fTS), which we 
call the DDM spectrogram. It will be shown that such TFR 
exhibits desirable properties, especially when combined with 
the linear frequency change estimate 3(r2). 

3. MULTIPLE TRACKING WITH A MARKOV 
RENEWAL PROCESS MODEL 

The task of tracking multiple sound sources in an acoustic 
scene may be approached using established multiple-tracking 
paradigms IJ6J . However, such models do not account for 



structured patterns of emission and silence, as is common in 
many sound types including birdsong. For example the fac- 
torial hidden Markov model does not formally model gaps 
although silent states can be added to the representation ifTTll ; 
however it assumes an unchanging number of sources. 

In order to track a varying number of intermittent sources, 
ifTOl introduced a multiple tracking model with sources mod- 
elled as instances of a Markov renewal process. A Markov 
renewal process is a point process in which the current state 
stochastically determines the following state, as well as the 
time gap between them: 

P(r„+i < = j|(Xi,ri), . . . , (X„ = i,T„)) 

Vn > 1, t > 0, i,j e S, (11) 

where observations are received in the form {{X,T)} with 
state X and time T, and Tn+i is the time difference T„+i— T„. 

Note that t is known if the observations represent a single 
sequence, but if the observations may represent multiple se- 
quences as well as clutter noise then the causal structure is 
unknown and t is hidden. In that case we can estimate the 
structure by choosing a partitioning of the data into K clus- 
ters plus H noise events so as to maximise the likelihood 

K H 

k=l r)=l 

where pMRp(fc) represents the likelihood of the observation 
subsequence in cluster k being generated by a single MRP, 
with internal transition likelihoods as in ( [TT] i, and pnoise(?7) 
represents the likelihood of a single noise datum. In this 
multiple Markov renewal process (MMRP) setting, inferring 
the maximum likelihood solution is a combinatorial problem 
which can be addressed via graph-theoretic techniques 1 10 1. 

The MMRP inference technique does not operate directly 
on audio, but takes a set of timestamped event detections as 
input. In flOl the authors describe an experiment applied to 
birdsong, in which they use a simple cross-correlation signal 
detection technique applied to spectrogram data as the pre- 
processing step for their analysis. They observe that this step 
may be a limiting factor in overall performance, in part be- 
cause the cross-correlation may not recover the same detec- 
tions from audio mixtures as from monophonic audio, and in 
part because each detection has only a simple state represen- 
tation (frequency offset of the template match). 

Various refinements might be tried to improve on the results 
of ifTOl . such as alternative event detection based on dictionary 
learning techniques or sinusoidal modelling. However, in this 
work we will use the same detection technique as the previous 
authors, and demonstrate that using the distribution derivative 
method (DDM) of Section |2] improves the recovery of bird- 
song sequences within the same workflow, by improving the 
spectrotemporal detail in the underlying representation. 



4. EXPERIMENTS 

To validate the MMRP inference, the authors in |10| apply 
it to a dataset of birdsong audio files, using the solo files 
as training data and mixed audio files to test segregation of 
bird sounds into separate streams. We ran the same experi- 
ment, varying the underlying spectrogram representation and 
the amount of detail passed on to the later processing stages. 

The dataset consists of individual recordings of the Com- 
mon Chiffchaff (Phylloscopus collybita) collected around Eu- 
rope and submitted to the public database Xeno CantoQ The 
specific recordings used are available online]^ For each 
recording, as well as for the synthetic mixtures, we analysed 
the audio via a standard STFT spectrogram (sample rate 44. 1 
kHz, frame size 1024, 50% overlap, Hann window), and sep- 
arately via the DDM spectrogram described in Section [2] 

We used the same cross-correlation template-matching 
paradigm as |[Tol to detect individual syllables of birdsong. 
To detect syllables from the standard spectrogram we used 
the same manually-specified template. Additionally we tested 
two strategies to detect syllables from the DDM spectro- 
gram: the standard 2D time-frequency template, or a 3D time- 
frequency-FM template created by augmenting the template 
with a third dimension representing the FM values expected 
in the syllable. These FM values were calculated from the 
frequency slope implied by the shape of the template. 

By testing these three variants of the template-matching 
process (STFT, DDM, DDM with FM information), we could 
evaluate whether improving the detection could have positive 
effects on the birdsong segregation. However, we also wanted 
to investigate whether the MMRP segregation process would 
be improved by giving it access to a more detailed represen- 
tation of each detected syllable. To that end, we tested the ap- 
proach of iTIOl . which encodes each syllable state X simply as 
a single freqeuency-offset value, against a modified approach 
in which spectral detail from within the detection region is en- 
coded as a vector-valued state. For each detected syllable, we 
determined a simple feature representing the time evolution of 
the spectral energy in the detection region: the frequency of 
the peak bin in each frame, downsampled by a factor of four 
to alleviate curse-of-dimensionality concerns. This produced 
a vector of five frequency values for each detection. Our hy- 
pothesis was that this richer representation would allow the 
MMRP inference to make clearer distinctions between true 
and false transitions, and thus improve the performance. 

In our tests we used the same baseline and gold-standard 
systems as in |[TOl. The baseline was based on a Gaussian 
mixture model (GMM) signal-vs-noise classifier which does 
not make use of transition likelihoods; the gold standard was 
not to use the mixture audio files as input, but to combine the 
detections produced from the monophonic file analysis, rep- 
resenting the event detections that would be recovered in the 

' htt p : / /www ■ xeno-canto ■ orq/europe| 

^.http : / / archive . org/details/ chiffchaff 25 
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Fig. 2. F-measure statistics for signal-noise separation (Fsn, top row) and recovery of transitions (Ftrans, bottom row). The 
three columns show results using the three different signal representations: standard STFT spectrogram (left), DDM (middle), 
and DMM including first-order FM information (right). The solid black line shows performance using the standard encoding 
of each detection as a single value, while the dashed black line shows performance using the more detailed encoding with five 
frequency values per syllable. Means and standard errors are shown, five-fold crossvalidation. 



ideal case that detections from an audio mixture are the same 
as those from the separate audio signals. However, our main 
point of comparison was between the different underlying 
representations, to examine whether the improved spectro- 
gram and/or the more detailed output improves performance. 

As in iTfOl . we performed five-fold crossvalidation, with 
the standard F-measure as our evaluation statistic applied in 
two ways: Fsn is the F-measure for signal/noise separation, 
and Ftrans is the F-measure for recovering true event-to-event 
transitions (i.e. segregating the signal correctly into sources). 

Results are shown in Figure[2] It is evident from the graphs 
that performance improves from the left plots to the right 
plots: using DDM rather than the STFT spectrogram im- 
proves performance, and using DDM with the FM informa- 
tion included in the detection step improves it further still. 
This applies for both Fsn and Ft^^na- (Interestingly, the use 
of DDM with FM information also improves the performance 
of the baseline non-MMRP inference.) However, the effect of 
passing the more detailed state representation in to the MMRP 
inference (the solid lines vs. the dashed lines) appears to im- 
prove i^sN without notably changing i^trans- 

We confirmed these observations using a repeated- 
measures ANOVA test. For each evaluation measure we en- 
tered three factors: the spectrogram type, the state represen- 
tation, and the number of signals in the mixture. For Fsn, 
significant effects were found for all three factors (each sig- 
nificant at p < 0.006). For Fdans, significant effects were 
found for the spectrogram type and the number of signals in 



the mixture (each p < 0.007), but the state representation was 
not significant (p = 0.056). For both evaluation measures, a 
significant two-way interaction was also found for spectro- 
gram mode combined with number of signals (p < 0.007). 

Overall, in this experiment we achieved around 20 percent- 
age point improvements in both Fsn and F^^ns, using a com- 
bination of the DDM spectrogram, the use of FM information 
in template-matching, and passing a more detailed state rep- 
resentation to the source-segregation stage. 

5. CONCLUSIONS 

We have considered a maximum-likelihood technique for 
tracking multiple singing birds in an audio recording, and 
demonstrated that it can benefit strongly from an improved 
underlying spectrogram representation. We applied a variant 
of the DDM technique, using a range of spectral bins to infer 
fine detail about modulated sinusoids, which is particularly 
pertinent in the case of birdsong because of the presence of 
rapid pitch modulations. We also demonstrated that passing a 
rich feature representation to the later inference stage also im- 
proves tracking. Altogether, our modifications yield approxi- 
mately 20 percentage point improvement in the F-measure. 
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